1 Introduction
1.1 Background
The housing market in Singapore comprises both public and private sectors. Public housing serves individuals with monthly household incomes up to S$14,000, while those with higher incomes typically turn to the private residential market.
1.2 Project Objectives
In this exercise, Exploratory Data Analysis (EDA) methods and ggplot functions are used to explore:
the distribution of
the relationship
2.Getting started
2.1 Importing libraries
We load the following R packages using the pacman::p_load() function:
#|eval=FALSE
pacman::p_load(tidyverse, ggstatsplot, ggridges, plotly, dplyr,wesanderson,lubridate)2.2 loading data
The dataset used in the exercise is transaction data of REALIS
We import this dataset by using read.csv() function
files1<-read.csv("data/ResidentialTransaction20240308160536.csv")
files2<-read.csv("data/ResidentialTransaction20240308160736.csv")
files3<-read.csv("data/ResidentialTransaction20240308161009.csv")
files4<-read.csv("data/ResidentialTransaction20240308161109.csv")
files5<-read.csv("data/ResidentialTransaction20240414220633.csv")2.3 Glimpse of data
Using the glimpse() function, we see the first 5 rows of five datasets (files1 to files5). These datasets likely contain private residential property transaction data, possibly segmented by different criteria such as date, type, and price. Further exploration is needed to understand the meaning of the fields, data types, presence of missing values, and how to integrate these datasets for analysis and visualization.
Show the code
head(files1,5) Project.Name Transacted.Price.... Area..SQFT. Unit.Price....PSF.
1 THE REEF AT KING'S DOCK 2,317,000 882.65 2,625
2 URBAN TREASURES 1,823,500 882.65 2,066
3 NORTH GAIA 1,421,112 1,076.4 1,320
4 NORTH GAIA 1,258,112 1,033.34 1,218
5 PARC BOTANNIA 1,280,000 871.88 1,468
Sale.Date Address Type.of.Sale Type.of.Area
1 01 Jan 2023 12 HARBOURFRONT AVENUE #05-32 New Sale Strata
2 02 Jan 2023 205 JALAN EUNOS #08-02 New Sale Strata
3 02 Jan 2023 29 YISHUN CLOSE #08-10 New Sale Strata
4 02 Jan 2023 45 YISHUN CLOSE #07-42 New Sale Strata
5 03 Jan 2023 12 FERNVALE STREET #06-16 Resale Strata
Area..SQM. Unit.Price....PSM. Nett.Price... Property.Type
1 82 28,256 - Condominium
2 82 22,238 - Condominium
3 100 14,211 - Executive Condominium
4 96 13,105 - Executive Condominium
5 81 15,802 - Condominium
Number.of.Units Tenure Completion.Date
1 1 99 yrs from 12/01/2021 Uncompleted
2 1 Freehold Uncompleted
3 1 99 yrs from 15/02/2021 Uncompleted
4 1 99 yrs from 15/02/2021 Uncompleted
5 1 99 yrs from 28/12/2016 2022
Purchaser.Address.Indicator Postal.Code Postal.District Postal.Sector
1 HDB 97996 4 9
2 Private 419535 14 41
3 HDB 269343 27 26
4 HDB 269294 27 26
5 HDB 797391 28 79
Planning.Region Planning.Area
1 Central Region Bukit Merah
2 East Region Bedok
3 North Region Yishun
4 North Region Yishun
5 North East Region Sengkang
Show the code
head(files2,5) Project.Name Transacted.Price.... Area..SQFT. Unit.Price....PSF.
1 THE GAZANIA 1,528,000 678.13 2,253
2 THE GAZANIA 1,938,000 958 2,023
3 ONE PEARL BANK 2,051,000 699.66 2,931
4 URBAN TREASURES 1,850,700 882.65 2,097
5 HYLL ON HOLLAND 2,021,500 699.66 2,889
Sale.Date Address Type.of.Sale Type.of.Area Area..SQM.
1 01 Apr 2023 15 HOW SUN DRIVE #02-31 New Sale Strata 63
2 01 Apr 2023 7 HOW SUN DRIVE #01-12 New Sale Strata 89
3 01 Apr 2023 1 PEARL BANK #32-16 New Sale Strata 65
4 01 Apr 2023 205 JALAN EUNOS #05-05 New Sale Strata 82
5 01 Apr 2023 97 HOLLAND ROAD #10-25 New Sale Strata 65
Unit.Price....PSM. Nett.Price... Property.Type Number.of.Units
1 24,254 - Condominium 1
2 21,775 - Condominium 1
3 31,554 - Apartment 1
4 22,570 - Condominium 1
5 31,100 - Condominium 1
Tenure Completion.Date Purchaser.Address.Indicator
1 Freehold 2022 N.A
2 Freehold 2022 Private
3 99 yrs from 01/03/2019 Uncompleted Private
4 Freehold Uncompleted HDB
5 Freehold Uncompleted Private
Postal.Code Postal.District Postal.Sector Planning.Region Planning.Area
1 538545 19 53 North East Region Serangoon
2 538530 19 53 North East Region Serangoon
3 169016 3 16 Central Region Outram
4 419535 14 41 East Region Bedok
5 278541 10 27 Central Region Bukit Timah
Show the code
head(files3,5) Project.Name Transacted.Price.... Area..SQFT. Unit.Price....PSF. Sale.Date
1 MYRA 1,658,000 667.37 2,484 01 Jul 2023
2 NORTH GAIA 1,449,000 1,076.4 1,346 01 Jul 2023
3 NORTH GAIA 1,365,000 1,076.4 1,268 01 Jul 2023
4 NORTH GAIA 1,231,000 958 1,285 01 Jul 2023
5 NORTH GAIA 1,272,000 1,001.05 1,271 01 Jul 2023
Address Type.of.Sale Type.of.Area Area..SQM.
1 9 MEYAPPA CHETTIAR ROAD #02-07 New Sale Strata 62
2 27 YISHUN CLOSE #10-06 New Sale Strata 100
3 27 YISHUN CLOSE #05-06 New Sale Strata 100
4 35 YISHUN CLOSE #08-25 New Sale Strata 89
5 45 YISHUN CLOSE #09-45 New Sale Strata 93
Unit.Price....PSM. Nett.Price... Property.Type Number.of.Units
1 26,742 - Apartment 1
2 14,490 - Executive Condominium 1
3 13,650 - Executive Condominium 1
4 13,831 - Executive Condominium 1
5 13,677 - Executive Condominium 1
Tenure Completion.Date Purchaser.Address.Indicator
1 Freehold Uncompleted N.A
2 99 yrs from 15/02/2021 Uncompleted HDB
3 99 yrs from 15/02/2021 Uncompleted HDB
4 99 yrs from 15/02/2021 Uncompleted HDB
5 99 yrs from 15/02/2021 Uncompleted HDB
Postal.Code Postal.District Postal.Sector Planning.Region Planning.Area
1 358456 13 35 Central Region Toa Payoh
2 769342 27 76 North Region Yishun
3 769342 27 76 North Region Yishun
4 769299 27 76 North Region Yishun
5 769294 27 76 North Region Yishun
Show the code
head(files4,5) Project.Name Transacted.Price.... Area..SQFT. Unit.Price....PSF. Sale.Date
1 LEEDON GREEN 1,749,000 538.2 3,250 01 Oct 2023
2 LIV @ MB 3,148,740 1,453.14 2,167 01 Oct 2023
3 MORI 2,422,337 1,259.39 1,923 01 Oct 2023
4 THE ARDEN 1,330,000 721.19 1,844 01 Oct 2023
5 LENTOR MODERN 2,237,000 1,130.22 1,979 01 Oct 2023
Address Type.of.Sale Type.of.Area Area..SQM.
1 26 LEEDON HEIGHTS #11-08 New Sale Strata 50
2 114A ARTHUR ROAD #01-01 New Sale Strata 135
3 223 GUILLEMARD ROAD #05-21 New Sale Strata 117
4 6 PHOENIX ROAD #01-18 New Sale Strata 67
5 3 LENTOR CENTRAL #05-03 New Sale Strata 105
Unit.Price....PSM. Nett.Price... Property.Type Number.of.Units
1 34,980 - Condominium 1
2 23,324 - Condominium 1
3 20,704 - Apartment 1
4 19,851 - Apartment 1
5 21,305 - Apartment 1
Tenure Completion.Date Purchaser.Address.Indicator
1 Freehold Uncompleted Private
2 99 yrs from 23/11/2021 Uncompleted Private
3 Freehold Uncompleted Private
4 99 yrs from 14/07/2023 Uncompleted Private
5 99 yrs from 26/10/2021 Uncompleted Private
Postal.Code Postal.District Postal.Sector Planning.Region Planning.Area
1 266221 10 26 Central Region Bukit Timah
2 439826 15 43 Central Region Marine Parade
3 399738 14 39 Central Region Geylang
4 668159 23 66 West Region Bukit Batok
5 788888 26 78 North East Region Ang Mo Kio
Show the code
head(files5,5) Project.Name Transacted.Price.... Area..SQFT. Unit.Price....PSF.
1 THE LANDMARK 2,726,888 1,076.4 2,533
2 POLLEN COLLECTION 3,850,000 1,808.35 2,129
3 SKY EDEN@BEDOK 2,346,000 1,087.16 2,158
4 TERRA HILL 2,190,000 807.3 2,713
5 PINETREE HILL 1,954,000 796.54 2,453
Sale.Date Address Type.of.Sale Type.of.Area Area..SQM.
1 01 Jan 2024 173 CHIN SWEE ROAD #22-11 New Sale Strata 100
2 01 Jan 2024 34 POLLEN PLACE New Sale Land 168
3 01 Jan 2024 1 BEDOK CENTRAL #09-10 New Sale Strata 101
4 01 Jan 2024 18A YEW SIANG ROAD #03-12 New Sale Strata 75
5 01 Jan 2024 36 PINE GROVE #05-18 New Sale Strata 74
Unit.Price....PSM. Nett.Price... Property.Type Number.of.Units
1 27,269 - Condominium 1
2 22,917 - Terrace House 1
3 23,228 - Apartment 1
4 29,200 - Apartment 1
5 26,405 - Condominium 1
Tenure Completion.Date Purchaser.Address.Indicator
1 99 yrs from 28/08/2020 Uncompleted Private
2 99 yrs from 09/12/2019 Uncompleted N.A
3 99 yrs from 05/01/2022 Uncompleted HDB
4 Freehold Uncompleted N.A
5 99 yrs from 12/09/2022 Uncompleted Private
Postal.Code Postal.District Postal.Sector Planning.Region Planning.Area
1 169878 3 16 Central Region Outram
2 807233 28 80 North East Region Serangoon
3 469657 16 46 East Region Bedok
4 118992 5 11 Central Region Queenstown
5 598444 21 59 Central Region Bukit Timah
2.3.1 Union Dataset
This code checks if the column names of the five datasets are consistent.
all_colnames <- c(names(files1), names(files2), names(files3), names(files4), names(files5))
if(length(unique(all_colnames)) != length(names(files1))) {
stop("Column names are different, files need adjustment to have consistent column names.")
}This code checks if the data types of the five datasets are consistent.
for(col in names(files1)) {
if(!all(sapply(list(files1, files2, files3, files4, files5), function(x) class(x[[col]]) == class(files1[[col]])))) {
stop("Data types are different, files need adjustment to have consistent data types.")
}
}The five files can be merged successfully. They have consistent column names, data types, and equal row counts, making them compatible for union operation.
uniondata <- rbind(files1, files2, files3, files4, files5)2.3.2 Check Dataset
Checking the structure of data using str():
str(uniondata)'data.frame': 26806 obs. of 21 variables:
$ Project.Name : chr "THE REEF AT KING'S DOCK" "URBAN TREASURES" "NORTH GAIA" "NORTH GAIA" ...
$ Transacted.Price.... : chr "2,317,000" "1,823,500" "1,421,112" "1,258,112" ...
$ Area..SQFT. : chr "882.65" "882.65" "1,076.4" "1,033.34" ...
$ Unit.Price....PSF. : chr "2,625" "2,066" "1,320" "1,218" ...
$ Sale.Date : chr "01 Jan 2023" "02 Jan 2023" "02 Jan 2023" "02 Jan 2023" ...
$ Address : chr "12 HARBOURFRONT AVENUE #05-32" "205 JALAN EUNOS #08-02" "29 YISHUN CLOSE #08-10" "45 YISHUN CLOSE #07-42" ...
$ Type.of.Sale : chr "New Sale" "New Sale" "New Sale" "New Sale" ...
$ Type.of.Area : chr "Strata" "Strata" "Strata" "Strata" ...
$ Area..SQM. : chr "82" "82" "100" "96" ...
$ Unit.Price....PSM. : chr "28,256" "22,238" "14,211" "13,105" ...
$ Nett.Price... : chr "-" "-" "-" "-" ...
$ Property.Type : chr "Condominium" "Condominium" "Executive Condominium" "Executive Condominium" ...
$ Number.of.Units : int 1 1 1 1 1 1 1 1 1 1 ...
$ Tenure : chr "99 yrs from 12/01/2021" "Freehold" "99 yrs from 15/02/2021" "99 yrs from 15/02/2021" ...
$ Completion.Date : chr "Uncompleted" "Uncompleted" "Uncompleted" "Uncompleted" ...
$ Purchaser.Address.Indicator: chr "HDB" "Private" "HDB" "HDB" ...
$ Postal.Code : int 97996 419535 269343 269294 797391 548742 275780 423510 126751 738086 ...
$ Postal.District : int 4 14 27 27 28 19 10 15 5 25 ...
$ Postal.Sector : int 9 41 26 26 79 54 27 42 12 73 ...
$ Planning.Region : chr "Central Region" "East Region" "North Region" "North Region" ...
$ Planning.Area : chr "Bukit Merah" "Bedok" "Yishun" "Yishun" ...
using is.na we see there are no missing values in the data
for(column_name in names(uniondata)){
na_count<- sum(is.na(uniondata[[column_name]]))
}
na_count[1] 0
Using the duplicated function, we see that there are no duplicate entries in the data.
uniondata[duplicated(uniondata),] [1] Project.Name Transacted.Price....
[3] Area..SQFT. Unit.Price....PSF.
[5] Sale.Date Address
[7] Type.of.Sale Type.of.Area
[9] Area..SQM. Unit.Price....PSM.
[11] Nett.Price... Property.Type
[13] Number.of.Units Tenure
[15] Completion.Date Purchaser.Address.Indicator
[17] Postal.Code Postal.District
[19] Postal.Sector Planning.Region
[21] Planning.Area
<0 rows> (or 0-length row.names)
2.3.3 data set description
| Field | Description |
|---|---|
| Project.Name | The name of the housing project or development. |
| Transacted.Price | The price at which the property was transacted. |
| Area (SQFT) | The area of the property in square feet. |
| Unit.Price (PSF) | The price per square foot of the property. |
| Sale.Date | The date on which the property was sold. |
| Address | The address of the property. |
| Type.of.Sale | The type of sale, e.g., resale, new sale, sub sale. |
| Type.of.Area | The type of area, e.g., built-up area, land area. |
| Area (SQM) | The area of the property in square meters. 1SQM=10.764SQFT |
| Unit.Price (PSM) | The price per square meter of the property. |
| Nett.Price | The final price after deductions or adjustments. |
| Property.Type | The type of property, e.g., condominium, landed house, HDB flat. |
| Number.of.Units | The number of units in the housing development. |
| Tenure | The land tenure of the property, e.g., freehold, leasehold. |
| Completion.Date | The date when the property construction was completed. |
| Purchaser.Address.Indicator | It refers to the type of residence (HDB flat or private property) of the purchaser’s address as shown in the caveat, and is not indicative of whether the place of residence is owned by the purchaser. Where the information is not available, ‘N.A’ is indicated. |
| Postal.Code | The postal code of the property. |
| Postal.District | The postal district of the property. |
| Postal.Sector | The postal sector of the property. |
| Planning.Region | Singapore is divided into 5 areas called planning regions to facilitate the planning of the use and development of land for the whole of Singapore. The 5 regions are Central Region, East Region, North East Region, North Region and West Region. Each Planning Region consists of several planning areas which are further divided into sub-zones. A Development Guide Plan (DGP) is prepared for each planning area (see Development Guide Plan). |
| Planning.Area | The planning area where the property is located. |
2.3.4 Exploration of Dataset Features
The following code snippet is used to explore categorical data features, including counting and identifying unique values for different features. This will help us better understand the structure and content of the dataset, providing a foundation for further exploration and analysis.
code
library(dplyr)
# Convert Type.of.Sale
unique_type_of_sale <- unique(uniondata$Type.of.Sale)
table_type_of_sale <- table(uniondata$Type.of.Sale)
df_type_of_sale <- data.frame(Type_of_Sale = unique_type_of_sale, Count = as.vector(table_type_of_sale))
# Convert Type.of.Area
unique_type_of_area <- unique(uniondata$Type.of.Area)
table_type_of_area <- table(uniondata$Type.of.Area)
df_type_of_area <- data.frame(Type_of_Area = unique_type_of_area, Count = as.vector(table_type_of_area))
# Convert Property.Type
unique_property_type <- unique(uniondata$Property.Type)
table_property_type <- table(uniondata$Property.Type)
df_property_type <- data.frame(Property_Type = unique_property_type, Count = as.vector(table_property_type))
# Convert Number.of.Units
unique_number_of_units <- unique(uniondata$Number.of.Units)
table_number_of_units <- table(uniondata$Number.of.Units)
df_number_of_units <- data.frame(Number_of_Units = unique_number_of_units, Count = as.vector(table_number_of_units))
# Convert Completion.Date
unique_completion_date <- unique(uniondata$Completion.Date)
table_completion_date <- table(uniondata$Completion.Date)
df_completion_date <- data.frame(Completion_Date = unique_completion_date, Count = as.vector(table_completion_date))
# Convert Purchaser.Address.Indicator
unique_purchaser_address_indicator <- unique(uniondata$Purchaser.Address.Indicator)
table_purchaser_address_indicator <- table(uniondata$Purchaser.Address.Indicator)
df_purchaser_address_indicator <- data.frame(Purchaser_Address_Indicator = unique_purchaser_address_indicator, Count = as.vector(table_purchaser_address_indicator))
# Convert Postal.District
unique_postal_district <- unique(uniondata$Postal.District)
table_postal_district <- table(uniondata$Postal.District)
df_postal_district <- data.frame(Postal_District = unique_postal_district, Count = as.vector(table_postal_district))
# Convert Postal.Sector
unique_postal_sector <- unique(uniondata$Postal.Sector)
table_postal_sector <- table(uniondata$Postal.Sector)
df_postal_sector <- data.frame(Postal_Sector = unique_postal_sector, Count = as.vector(table_postal_sector))
# Convert Planning.Region
unique_planning_region <- unique(uniondata$Planning.Region)
table_planning_region <- table(uniondata$Planning.Region)
df_planning_region <- data.frame(Planning_Region = unique_planning_region, Count = as.vector(table_planning_region))
# Convert Planning.Area
unique_planning_area <- unique(uniondata$Planning.Area)
table_planning_area <- table(uniondata$Planning.Area)
df_planning_area <- data.frame(Planning_Area = unique_planning_area, Count = as.vector(table_planning_area))
# Convert Planning.Area
unique_Nett.Price <- unique(uniondata$Nett.Price)
table_Nett.Price <- table(uniondata$Nett.Price)
df_Nett.Price <- data.frame(Nett.Price = unique_Nett.Price, Count = as.vector(table_Nett.Price))df_type_of_sale Type_of_Sale Count
1 New Sale 8644
2 Resale 16617
3 Sub Sale 1545
df_type_of_area Type_of_Area Count
1 Strata 1570
2 Land 25236
df_property_type Property_Type Count
1 Condominium 10772
2 Executive Condominium 10643
3 Terrace House 233
4 Semi-Detached House 3534
5 Apartment 524
6 Detached House 1100
df_number_of_units Number_of_Units Count
1 1 26794
2 43 5
3 4 1
4 60 2
5 3 1
6 2 1
7 11 1
8 7 1
df_completion_date Completion_Date Count
1 Uncompleted 682
2 2022 1
3 - 2
4 2015 1
5 1941 1
6 2000 1
7 1958 3
8 2014 1
9 2011 5
10 2009 8
11 2008 5
12 1998 4
13 2017 6
14 2018 1
15 1997 8
16 1990 3
17 2019 2
18 2010 5
19 2003 4
20 2016 4
21 1999 4
22 2001 12
23 1995 10
24 2007 4
25 1994 5
26 2012 2
27 1991 13
28 1976 11
29 2006 10
30 2005 52
31 1996 56
32 2002 42
33 2013 70
34 1967 30
35 2021 37
36 2004 51
37 1980 26
38 1986 60
39 1983 55
40 1977 85
41 1993 95
42 1988 20
43 1978 20
44 2020 40
45 1975 58
46 1992 48
47 2023 85
48 1982 84
49 1989 193
50 1981 148
51 1979 236
52 1954 433
53 1985 424
54 1984 461
55 1973 485
56 1968 192
57 1972 245
58 1964 265
59 1955 419
60 1961 355
61 1970 196
62 1956 226
63 1957 367
64 1969 376
65 1952 398
66 1987 522
67 1974 419
68 1963 702
69 1960 1247
70 1962 1172
71 1931 1440
72 0216 1293
73 1965 1287
74 1966 457
75 1951 173
76 2024 344
77 1971 712
78 1953 1032
79 1940 3
80 1959 8752
df_purchaser_address_indicator Purchaser_Address_Indicator Count
1 HDB 8719
2 Private 2733
3 N.A 15354
df_postal_district Postal_District Count
1 4 181
2 14 295
3 27 875
4 28 338
5 19 1195
6 10 9
7 15 303
8 5 277
9 25 1003
10 18 1682
11 8 826
12 9 548
13 11 567
14 16 940
15 1 2763
16 20 998
17 22 357
18 3 1480
19 23 2920
20 7 696
21 21 1449
22 17 899
23 12 2662
24 13 18
25 6 322
26 2 1365
27 24 1270
28 26 568
df_postal_sector Postal_Sector Count
1 9 112
2 41 9
3 26 60
4 79 219
5 54 76
6 27 286
7 42 52
8 12 267
9 73 593
10 43 335
11 53 248
12 52 241
13 45 386
14 21 9
15 22 244
16 30 59
17 82 115
18 76 162
19 46 584
20 1 419
21 48 390
22 57 415
23 64 421
24 51 458
25 14 249
26 67 66
27 75 511
28 66 91
29 25 381
30 18 76
31 55 97
32 59 347
33 47 123
34 50 246
35 32 269
36 65 152
37 13 273
38 68 459
39 39 1962
40 24 155
41 23 187
42 35 704
43 58 157
44 17 137
45 36 3
46 56 354
47 44 478
48 15 1002
49 38 1205
50 40 827
51 16 435
52 28 139
53 7 557
54 20 879
55 80 570
56 33 423
57 69 141
58 29 3
59 19 332
60 8 861
61 61 413
62 34 817
63 6 571
64 31 18
65 60 322
66 11 212
67 78 1056
68 77 17
69 10 1348
70 5 324
71 63 244
72 49 453
df_planning_region Planning_Region Count
1 Central Region 12200
2 East Region 3644
3 North Region 5037
4 North East Region 1625
5 West Region 4300
df_planning_area Planning_Area Count
1 Bukit Merah 1558
2 Bedok 1753
3 Yishun 460
4 Sengkang 1314
5 Hougang 610
6 Bukit Timah 828
7 Marine Parade 2344
8 Clementi 3
9 Woodlands 594
10 Serangoon 647
11 Tanglin 587
12 Tampines 1227
13 Kallang 1209
14 Rochor 531
15 Novena 368
16 Punggol 681
17 Sembawang 25
18 Downtown Core 1551
19 Bishan 65
20 Jurong West 348
21 Pasir Ris 931
22 Queenstown 102
23 Bukit Panjang 251
24 Bukit Batok 832
25 Museum 1
26 Newton 453
27 Southern Islands 945
28 Toa Payoh 374
29 Choa Chu Kang 209
30 Geylang 502
31 River Valley 836
32 Orchard 981
33 Singapore River 140
34 Outram 106
35 Tengah 2
36 Ang Mo Kio 1055
37 Jurong East 612
38 Mandai 18
39 Sungei Kadut 657
40 Changi 320
41 Paya Lebar 776
df_Nett.Price Nett.Price Count
1 - 26770
2 2,129,200 1
3 2,018,062 1
4 2,224,300 1
5 1,752,240 1
6 2,289,600 1
7 2,307,600 1
8 2,325,600 1
9 2,397,600 1
10 2,433,600 1
11 2,451,600 1
12 3,960,900 1
13 1,506,600 1
14 3,987,900 1
15 2,335,888 1
16 2,343,600 1
17 3,697,000 1
18 3,180,900 1
19 1,941,900 1
20 2,137,192 1
21 3,050,000 1
22 3,445,100 1
23 1,935,070 1
24 1,606,400 1
25 11,712,000 1
26 14,212,000 1
27 2,040,900 1
28 14,463,000 1
29 14,358,000 1
30 1,950,000 1
31 14,124,000 1
32 12,203,000 1
33 1,990,000 1
34 1,558,579 1
35 11,795,000 1
36 1,628,000 1
37 1,492,600 1
Before diving into detailed data exploration, it’s essential to understand the key numerical variables in our dataset. The summary() function provides a concise summary of these variables, including measures of central tendency, dispersion, and distribution.
summary(uniondata$Transacted.Price....) Length Class Mode
26806 character character
summary(uniondata$Unit.Price....PSF.) Length Class Mode
26806 character character
summary(uniondata$Area..SQFT.) Length Class Mode
26806 character character
Summary of the Dateset findings:
Upon examining the data’s structure, missing values, and duplicates, it appears that the data is tidy and does not require further processing.
The information contained in the “Project.Name” and “Address” columns appears overly intricate and does not significantly contribute to the analysis. Therefore, these columns can be safely removed.
Both “SQM” and “SQFT” serve as units of measurement for area. It suffices to retain one of them for analysis purposes.
The “Tenure” information can be simplified by retaining only the duration in years and removing the specific date portion, facilitating clearer analysis.
Given considerations for visualization, the “Postal” information is redundant. It is suggested to eliminate the “Postal.Sector” column.
Anomalies such as the presence of values like 60 and 43 in the “Number of Units” column warrant further investigation to ascertain whether they are indeed outliers or erroneous entries.
Upon inspecting the “Nett.Price” column, it is observed that only 27 rows contain specific numerical values, while the rest are empty. Hence, this feature proves ineffective and can be safely discarded.
This concise summary encapsulates the key findings of the EDA process, laying the groundwork for further analysis and decision-making.
3 Data preparation
3.1 Recast variable
Transacted.Price….,Transacted.Price….,Area..SQFT. are of type chr,As they should be numeric, they will be cast as num class.
# Remove commas and convert to numeric
uniondata$Transacted.Price.... <- as.numeric(gsub(",", "", uniondata$Transacted.Price....))
uniondata$Area..SQFT. <- as.numeric(gsub(",", "", uniondata$Area..SQFT.))
uniondata$Unit.Price....PSF. <- as.numeric(gsub(",", "", uniondata$Unit.Price....PSF.))3.2 Create new variables
From the original Tenure column, extracting the numeric part, removing non-numeric characters, and classifying based on the extracted numeric values.
uniondata <- uniondata %>%
mutate(
Tenure_cleaned = gsub("[^0-9]", "", Tenure),
Tenure_numeric = as.numeric(substr(Tenure_cleaned, 1, 3)),
Tenure_level = case_when(
grepl("Freehold", Tenure) ~ "Freehold",
Tenure_numeric < 200 ~ "< 200",
TRUE ~ "> 200"
)
)
unique(uniondata$Tenure_level)[1] "> 200" "Freehold" "< 200"
3.3 Segmenting Variables into Groups
Creates boxplots with summary statistics, aiming to understand the distribution and characteristics of these variables visually.
filter_area <- uniondata %>%
filter(Area..SQFT. < 5000)
library(ggplot2)
library(patchwork)
# Calculate summary statistics for Area..SQFT.
mean_val <- mean(filter_area$`Area..SQFT.`)
median_val <- median(filter_area$`Area..SQFT.`)
q1_val <- quantile(filter_area$`Area..SQFT.`, 0.25)
q3_val <- quantile(filter_area$`Area..SQFT.`, 0.75)
# Create boxplot for Area..SQFT.
p1 <- ggplot(filter_area, aes(x = "", y = `Area..SQFT.`)) +
geom_boxplot(fill = "#1a80bb") + # Fill boxplot with color
stat_summary(fun = "mean", geom = "point", shape = 8,
size = 2, color = "#fefefe") +
annotate("text", x = 0.8, y = max(filter_area$`Area..SQFT.`),
label = paste("Mean:", round(mean_val, 2), "\n",
"Median:", round(median_val, 2), "\n",
"Q1:", round(q1_val, 2), "\n",
"Q3:", round(q3_val, 2)),
vjust = 1, hjust = 0.5, size = 3) +
labs(y = "Area (SQFT)", x="",title = "Boxplot of Area (SQFT)") +
theme_minimal()+
theme( text = element_text(size = 8))
# Calculate summary statistics for Unit.Price....PSF.
mean_val_psf <- mean(uniondata$`Unit.Price....PSF.`)
median_val_psf <- median(uniondata$`Unit.Price....PSF.`)
q1_val_psf <- quantile(uniondata$`Unit.Price....PSF.`, 0.25)
q3_val_psf <- quantile(uniondata$`Unit.Price....PSF.`, 0.75)
# Create boxplot for Unit.Price....PSF.
p2 <- ggplot(uniondata, aes(x = "", y = `Unit.Price....PSF.`)) +
geom_boxplot(fill = "#1a80bb") + # Fill boxplot with color
stat_summary(fun = "mean", geom = "point", shape = 8,
size = 2, color = "#fefefe") +
annotate("text", x = 0.8, y = max(uniondata$`Unit.Price....PSF.`),
label = paste("Mean:", round(mean_val_psf, 2), "\n",
"Median:", round(median_val_psf, 2), "\n",
"Q1:", round(q1_val_psf, 2), "\n",
"Q3:", round(q3_val_psf, 2)),
vjust = 1, hjust = 0.5, size = 3) +
labs(y = "Price (PSF)", x="", title = "Boxplot of Price (PSF)") +
theme_minimal()+
theme( text = element_text(size = 8))
# Combine the plots
(p1 + p2)Based on the boxplot results, the Area and Price need to be grouped.
They are divided into four groups: Small, Medium, Large, and plus, based on the first quartile (Q1), median, third quartile (Q3), and outliers (values beyond 1.5 times the interquartile range, IQR).
These groupings help categorize the data into different size categories, providing insights into the distribution and variability of the Area and Price variables.
# Grouping the Area variable
uniondata$Area_Group <- cut(uniondata$`Area..SQFT.`,
breaks = c(-Inf, quantile(uniondata$`Area..SQFT.`, 0.25),
quantile(uniondata$`Area..SQFT.`, 0.5),
quantile(uniondata$`Area..SQFT.`, 0.75),
Inf),
labels = c("Small", "Medium", "Large", "Plus"))
# Grouping the Price variable
uniondata$Price_Group <- cut(uniondata$`Unit.Price....PSF.`,
breaks = c(-Inf, quantile(uniondata$`Unit.Price....PSF.`, 0.25),
quantile(uniondata$`Unit.Price....PSF.`, 0.5),
quantile(uniondata$`Unit.Price....PSF.`, 0.75),
Inf),
labels = c("Low", "Medium-Low", "Medium-High", "High"))3.4 Deciding on Plausible Values to Use
Through exploring the dataset in the previous step, removing irrelevant or redundant columns can facilitate the subsequent analysis. Additionally, the analysis of the dataset is limited to the first quarter of 2024.
uniondata$Sale.Date <- dmy(uniondata$Sale.Date)
cols_to_exclude <- c("Project.Name", "Nett.Price...", "Address", "Area..SQM.","Unit.Price....PSM.","Postal.Code","Postal.Sector","Tenure_cleaned","Tenure_numeric","Tenure")
data2024 <- uniondata %>%
select(-any_of(cols_to_exclude))
head(data2024) Transacted.Price.... Area..SQFT. Unit.Price....PSF. Sale.Date Type.of.Sale
1 2317000 882.65 2625 2023-01-01 New Sale
2 1823500 882.65 2066 2023-01-02 New Sale
3 1421112 1076.40 1320 2023-01-02 New Sale
4 1258112 1033.34 1218 2023-01-02 New Sale
5 1280000 871.88 1468 2023-01-03 Resale
6 5870000 3322.85 1767 2023-01-03 Resale
Type.of.Area Property.Type Number.of.Units Completion.Date
1 Strata Condominium 1 Uncompleted
2 Strata Condominium 1 Uncompleted
3 Strata Executive Condominium 1 Uncompleted
4 Strata Executive Condominium 1 Uncompleted
5 Strata Condominium 1 2022
6 Land Terrace House 1 -
Purchaser.Address.Indicator Postal.District Planning.Region Planning.Area
1 HDB 4 Central Region Bukit Merah
2 Private 14 East Region Bedok
3 HDB 27 North Region Yishun
4 HDB 27 North Region Yishun
5 HDB 28 North East Region Sengkang
6 Private 19 North East Region Hougang
Tenure_level Area_Group Price_Group
1 > 200 Medium High
2 Freehold Medium Medium-High
3 > 200 Large Low
4 > 200 Large Low
5 > 200 Medium Medium-Low
6 > 200 Plus Medium-High
str(data2024
)'data.frame': 26806 obs. of 16 variables:
$ Transacted.Price.... : num 2317000 1823500 1421112 1258112 1280000 ...
$ Area..SQFT. : num 883 883 1076 1033 872 ...
$ Unit.Price....PSF. : num 2625 2066 1320 1218 1468 ...
$ Sale.Date : Date, format: "2023-01-01" "2023-01-02" ...
$ Type.of.Sale : chr "New Sale" "New Sale" "New Sale" "New Sale" ...
$ Type.of.Area : chr "Strata" "Strata" "Strata" "Strata" ...
$ Property.Type : chr "Condominium" "Condominium" "Executive Condominium" "Executive Condominium" ...
$ Number.of.Units : int 1 1 1 1 1 1 1 1 1 1 ...
$ Completion.Date : chr "Uncompleted" "Uncompleted" "Uncompleted" "Uncompleted" ...
$ Purchaser.Address.Indicator: chr "HDB" "Private" "HDB" "HDB" ...
$ Postal.District : int 4 14 27 27 28 19 10 15 5 25 ...
$ Planning.Region : chr "Central Region" "East Region" "North Region" "North Region" ...
$ Planning.Area : chr "Bukit Merah" "Bedok" "Yishun" "Yishun" ...
$ Tenure_level : chr "> 200" "Freehold" "> 200" "> 200" ...
$ Area_Group : Factor w/ 4 levels "Small","Medium",..: 2 2 3 3 2 4 4 4 1 4 ...
$ Price_Group : Factor w/ 4 levels "Low","Medium-Low",..: 4 3 1 1 2 3 1 3 3 1 ...
4 Exploratory Data Analysis
4.1 EDA1 Real Estate Market Trend Analysis
`summarise()` has grouped output by 'ym'. You can override using the `.groups`
argument.
`summarise()` has grouped output by 'Date'. You can override using the
`.groups` argument.
Warning: A numeric `legend.position` argument in `theme()` was deprecated in ggplot2
3.5.0.
ℹ Please use the `legend.position.inside` argument of `theme()` instead.
library(tidyverse)
# Group data by month and year, calculate average house price
df_avg <- uniondata %>%
filter(Purchaser.Address.Indicator != "N.A") %>%
mutate(ym = format(Sale.Date, "%Y-%m")) %>%
group_by(ym, Purchaser.Address.Indicator) %>%
summarise(avg_price = mean(Transacted.Price....))
# Plot 1
plot1 <- ggplot(df_avg, aes(x = ym, y = avg_price, fill = Purchaser.Address.Indicator)) +
geom_col(position = position_dodge()) +
scale_fill_manual(values = c("#082a54", "#f0c571"), name = "Property type") +
geom_hline(yintercept = mean(df_avg$avg_price), color = "red", linetype = "dashed") +
geom_text(aes(x = 1, y = mean(df_avg$avg_price),
label = paste("Avg Price:", round(mean(df_avg$avg_price), 2))),
color = "red", size = 3, hjust = -0.2, vjust = -0.5) +
labs(x = "Month", y = "Price(PSF)", title = "Price (PSF) of HDB by Property type and Month") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "none",
text = element_text(size = 8))
# Count sales by month and property type
df_count <- uniondata %>%
mutate(Date = format(Sale.Date, "%Y-%m")) %>%
filter(Purchaser.Address.Indicator != "N.A") %>%
group_by(Date, Purchaser.Address.Indicator) %>%
summarise(Sales = n()) %>%
group_by(Date) %>%
mutate(Percentage = Sales / sum(Sales))
# Plot 2
plot2 <- ggplot(df_count, aes(x = Date, y = Sales, fill = Purchaser.Address.Indicator)) +
geom_bar(stat = "identity") +
geom_text(aes(label = scales::percent(Percentage, accuracy = 1)),
position = position_stack(vjust = 0.5),
size = 3,
color = "#4A5859") +
scale_fill_manual(values = c("#1a80bb", "#f2c45f"), name = "Property type") +
labs(x = "Month", y = "Sales", title = "Monthly Sales by Property type") +
theme_classic() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = c(0.8, 0.9), legend.title = element_blank(),
legend.key.size = unit(0.4, 'cm'), legend.key.height = unit(0.4, 'cm'),
legend.key.width = unit(0.4, 'cm'), text = element_text(size = 8))
# Combine plots
plot1 + plot2Based on the provided charts, we ca analyze the recent price trends and sales situation in different types of properties as follow:
Price Trends:
Prices for public housing (HDB) have remained stable, averaging between 1700 to 1800 SGD per square foot monthly. In contrast, private real estate prices fluctuate more and are consistently higher than HDB prices.
The long-term average price for HDB units is lower than the overall average, reflecting its affordability and welfare nature.
Prices in the first quarter of 2024 have decreased compared to 2023, especially for private real estate.
Sales Situation:
Monthly sales volumes for both HDB and private real estate maintain a ratio of around 60% to 40%, showing stable market demand.
The surge in HDB sales volume in 2023 has returned to normal levels in the first quarter of 2024.
Despite fluctuations, HDB sales consistently account for around 60% of total sales, highlighting its importance in meeting housing needs.
4.2 EDA2 Price Distribution of Housing in Five Regions of Singapore
$x
[1] "Region"
$y
[1] "Price(PSF)"
attr(,"class")
[1] "labels"
Picking joint bandwidth of 60.4
library(ggplot2)
library(ggridges)
# Calculate median values
median_values <- data2024 %>%
group_by(Planning.Region) %>%
summarise(median_value = median(Unit.Price....PSF.))
# Recode Planning.Region
data2024 <- data2024 %>%
mutate(Planning.Region = recode(Planning.Region,
"Central Region" = "Central",
"East Region" = "East",
"North Region" = "North",
"North East Region" = "North East",
"West Region" = "West"
))
# Plot 1
Plot1 <- ggplot(data2024, aes(Unit.Price....PSF., Planning.Region)) +
geom_density_ridges(aes(fill = Planning.Region), alpha = 0.2, quantile_lines = TRUE, quantile_fun = function(x, ...) median(x)) +
geom_text(data = median_values, aes(x = median_value - 200, y = Planning.Region, label = paste("Median Value:", round(median_value, 2))),
hjust = -0.2, vjust = -0.5, color = "black", size = 3) +
theme_minimal() +
theme(legend.position = "bottom",
text = element_text(size = 8),
plot.title = element_text(size = 12, face = "bold"),
legend.title = element_blank(),
legend.key.size = unit(0.4, 'cm'),
legend.key.height = unit(0.4, 'cm'),
legend.key.width = unit(0.4, 'cm')) +
labs(x = "Price(PSF)", y = "Region", title = "Unit Price Distribution by Region")
# Plot 2
Plot2 <- data2024 %>%
filter(Purchaser.Address.Indicator != "N.A") %>%
ggplot(aes(Planning.Region, Unit.Price....PSF., fill = Purchaser.Address.Indicator)) +
geom_boxplot() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "bottom",
text = element_text(size = 8),
legend.title = element_blank()) +
labs(x = "Region", y = "Price(PSF)")
# Combine plots
Plot1 + Plot2Based on the chart, we can analyze the price distribution of housing in the five regions of Singapore and different types of properties as follows:
The median price in the Central region is the highest at $2,224 per square foot, while the West region has the lowest median price at $1,529 per square foot. This may reflect the high-end positioning and high demand for properties in the city center.
The price distribution follows a normal distribution curve, but the peak and spread of each region’s distribution vary. The curves for the Central and West regions are steeper, indicating that prices are more concentrated in these areas, while those for the Northeast, North, and East regions are flatter, indicating greater price variation.
The box plots show significant price variations within each region, especially in the Northeast and West regions, where the price dispersion is the highest.
Private properties generally have higher prices than public housing (HDB), especially in the West and North regions, where the price gap between the two types of properties is significant. This aligns with the independent ownership and higher positioning of private properties.
Even within the same region, there are significant price differences between private properties and HDB flats, reflecting the diverse nature of the Singapore real estate market.
4.3 EDA3 Comparison of Unit Price and Area for Different Property Types in Singapore
`summarise()` has grouped output by 'Purchaser.Address.Indicator'. You can
override using the `.groups` argument.
Attaching package: 'scales'
The following object is masked from 'package:purrr':
discard
The following object is masked from 'package:readr':
col_factor
df_Typeprice <- uniondata %>%
filter(Purchaser.Address.Indicator !="N.A") %>%
group_by(Purchaser.Address.Indicator, Property.Type) %>%
summarise(Price = median(Unit.Price....PSF.),Area=median( Area..SQFT.))
# Loading required packages
library(tidyverse)
library(scales)
library(ggrepel)
# Plot
# Plot
ggplot(df_Typeprice, aes(x = Area, y = Price)) +
geom_point(aes(size = Price, color = Purchaser.Address.Indicator), alpha = 0.6) +
geom_text_repel(aes(label = Property.Type), size = 2.5) +
scale_color_manual(values = c("#C83E4D", "#4A5859"), name = "Property Type") + # Change legend title
labs(title = "Comparison of Area and Unit Price by Property Type", x = "Area", y = "Price (PSF)") + # Change axis labels
theme(text = element_text(size = 10),
plot.title = element_text(size = 12, face = "bold"),
legend.key.size = unit(0.5, 'cm'),
legend.key.height = unit(0.5, 'cm'),
legend.key.width = unit(0.5, 'cm'),
legend.title = element_text(size = 8)) Based on the chart data, we can can analyze the unit area prices and sizes for different types of properties as follows:
Private properties have higher prices than public housing (HDB), reflecting their nature and market positioning.
There are significant price differences among different types of private properties, with detached houses being the highest priced and condominiums being relatively lower.
Public housing (HDB) shows less variation in prices across different types, reflecting uniform price control and affordability.
Detached and semi-detached houses generally have larger sizes, while apartments and condominiums have smaller sizes.
Executive condominiums have the lowest unit area prices among different types of apartments, likely due to their mixed characteristics of private apartments and affordable housing.
4.4 EDA4 Regional Deviation in Unit Prices Across Singapore
library(ggplot2)
# Private
df_Aavg <- uniondata %>%
filter(Purchaser.Address.Indicator == "Private") %>%
group_by(Planning.Area) %>%
summarise(avg_Aprice = mean(Unit.Price....PSF.))
df_Aavg$p_z <- round((df_Aavg$avg_Aprice - mean(df_Aavg$avg_Aprice)) / sd(df_Aavg$avg_Aprice), 2)
df_Aavg$p_ztype <- ifelse(df_Aavg$p_z < 0, "below", "above")
df_Aavg <- df_Aavg[order(df_Aavg$p_z), ]
df_Aavg$Planning.Area <- factor(df_Aavg$Planning.Area, levels = df_Aavg$Planning.Area)
plot11 <- ggplot(df_Aavg, aes(x = Planning.Area, y = p_z, label = p_z)) +
geom_bar(stat = "identity", aes(fill = p_ztype), position = position_dodge2(width = 2), width = 0.8) +
scale_fill_manual(name = "Average Price", labels = c("Above Average", "Below Average"), values = c("below" = "#C83E4D", "above" = "#4A5859")) +
labs(title = "Unit Price Deviations by Area", y = "", subtitle = "Private") +
coord_flip() +
theme(legend.position = "None", text = element_text(size = 8), plot.title = element_text(size = 12, face = "bold"))
# HDB
df_AHavg <- uniondata %>%
filter(Purchaser.Address.Indicator == "HDB") %>%
group_by(Planning.Area) %>%
summarise(avg_Aprice = mean(Unit.Price....PSF.))
df_AHavg$p_z <- round((df_AHavg$avg_Aprice - mean(df_AHavg$avg_Aprice)) / sd(df_AHavg$avg_Aprice), 2)
df_AHavg$p_ztype <- ifelse(df_AHavg$p_z < 0, "below", "above")
df_AHavg <- df_AHavg[order(df_AHavg$p_z), ]
df_AHavg$Planning.Area <- factor(df_AHavg$Planning.Area, levels = df_AHavg$Planning.Area)
plot22 <- ggplot(df_AHavg, aes(x = Planning.Area, y = p_z, label = p_z)) +
geom_bar(stat = "identity", aes(fill = p_ztype), position = position_dodge2(width = 2), width = 0.8) +
scale_fill_manual(name = "Average Price", labels = c("Above Average", "Below Average"), values = c("below" = "#C83E4D", "above" = "#4A5859")) +
labs(y = "", x = "", subtitle = "HDB") +
coord_flip() +
theme(text = element_text(size = 8),
legend.title = element_blank(),
legend.position = c(0.25, 0.9),
legend.key.size = unit(0.4, 'cm'),
legend.key.height = unit(0.4, 'cm'),
legend.key.width = unit(0.4, 'cm'))
plot11 + plot22Based on this chart, we can analyze the deviation of unit housing prices in different regions of Singapore:
Private Properties:
Private property prices in Chinatown, Newton, and Orchard Road areas are significantly higher than the average, with the highest deviations. These areas have prime locations, which may be the primary reason.
Suburban areas such as Bukit Panjang, Serangoon, Balestier, and Bedok have private property prices below the average.
HDB Flats:
HDB flat prices in River Valley, Chinatown, and Newton areas are higher than the average, as these areas are close to the city center, which may be the main influencing factor.
Suburban areas such as Balestier, Serangoon, Bedok, and Choa Chu Kang have HDB flat prices generally below the average.